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(54) A conversational portal for providing conversational browsing and multimedia broadcast on 
demand 



(57) A system and method for providing conversa- 
tional (multimodal) access to information over a com- 
munications network from any location, at any time, uti- 
lizing any type of client/access, through a conversation- 
al (multimodal) portal. In one aspect, a conversational 
portal comprises a conversational (multi-modal) brows- 
er that is capable of conducting multi-modal dialog with 
client/access devices having varying input/output (I/O) 
modalities. The conversational browser retrieves infor- 
mation (such as content pages, applications) from an 
information source (for example, content server) in re- 
sponse to a request from a requesting client/access de- 
vice and then serves the retrieved information to the re- 



questing client/access device in a format that is compat- 
ible with the I/O modalities of the requesting client/ac- 
cess device. In another aspect, the conversational por- 
tal provides multimedia access on demand. The conver- 
sational portal comprises an audio indexing system for 
segmenting and indexing audio and multimedia data ob- 
tained from an information source, as well as a multi- 
media database for storing the indexed audio and multi- 
media data. A subscribing user can compose and main- 
tain a broadcast program wherein the user specifies 
which types, and in what order, different segments 
(news, radio, etc.) stored in the database are played 
back/broadcasted to the user. 
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Description 

Technical Field of the Invention 

[0001] The present invention relates generally to sys- 
tems and methods for accessing information over a 
communication network and, more particularly, to a con- 
versational portal employing a conversational browser 
to provide services such as conversational browsing 
and multimedia access on demand. 

Background of the Invention 

[0002] The computing world is evolving towards an 
era where billions of interconnected pervasive clients 
will communicate with powerful information servers. In- 
deed, this millennium will be characterized by the avail- 
ability of multiple information devices that make ubiqui- 
tous information access an accepted fact of life. This ev- 
olution towards billions of pervasive devices being inter- 
connected via the Internet, wireless networks or spon- 
taneous networks (such as Bluetooth and Jini) will rev- 
olutionize the principles underlying man-machine inter- 
action. [Jini is a trademark of Sun Microsystems Inc.] In 
the near future, personal information devices will offer 
ubiquitous access, bringing with them the ability to cre- 
ate, manipulate and exchange any information any- 
where and anytime using interaction modalities most 
suited to the user's current needs and abilities. Such de- 
vices will include familiar access devices such as con- 
ventional telephones, cell phones, smart phones, pock- 
et organizers. PDAs and PCs, which vary widely in the 
interface peripherals they use to communicate with the 
user At the same time, as this evolution progresses, us- 
ers will demand a consistent look, sound and feel in the 
user experience provided by these various information 
devices. 

[0003] The increasing availability of information, 
along with the rise in the computational power available 
to each user to manipulate this information, brings with 
it a concomitant need to increase the bandwidth of man- 
machine communication. The ability to access informa- 
tion via a multiplicity of appliances, each designed to 
suit the user's specific needs and abilities at any given 
time, necessarily means that these interactions should 
exploit all available input and output (I/O) modalities to 
maximize the bandwidth of man-machine communica- 
tion. Indeed, users will come to demand such multi-mo- 
dal interaction in order to maximize their interaction with 
information devices in hands-free, eyes-free environ- 
ments. 

[0004] Unfortunately, the current infrastructure is not 
entirely configured for providing seamless, multi-modal 
access to information. Indeed, although a plethora of in- 
formation can be accessed from servers over a network 
using an access device (for example, personal informa- 
tion and corporate information available on private net- 
works and public information accessible via a global 



computer network such as the Internet), the availability 
of such information may be limited by the modality of the 
client/access device or the platform-specific software 
applications with which the user is interacting to obtain 
5 such information. 

[0005] By way of example, currently, there are various 
types of portals (or gateways) that may be accessed on 
various networks to obtain desired information. For in- 
stance, well-known WWW (world wide web) portals in- 
to elude Yahoo! (which is open to the Internet and open to 
users) and AOL (which is open to the Internet and allows 
subscribing users to access proprietary content). [Ya- 
hoo! is a trademark of Yahoo! Inc.] These portals typi- 
cally include a directory of Web sites, a search engine, 
*5 news, weather information, e-mail, stock quotes, etc. 
Unfortunately, typically only a client/access device hav- 
ing full GUI capability can take advantage of such Web 
portals for accessing information. 
[0006] Other portals include wireless portals that are 
20 typically offered by telephone companies or wireless 
carriers (which provide proprietary content to subscrib- 
ing users). These wireless portals may be accessed by 
a client/access device having limited GUI capabilities 
declaratively driven by languages such as WML (wire- 
?5 less markup language) or CHTML (compact hypertext 
markup language). These wireless portals, however, do 
not offer seamless multi-modal access such as voice 
and GUI. since a separate voice mode is used for human 
communication and a separate and distinct mode is 
30 used for WAP (wireless application protocol) access and 
WML browsing. 

[0007] In addition, IVR (interactive voice response) 
services and telephone companies can provide voice 
portals (which provide proprietary content to subscribing 

35 users) having only speech I/O capabilities. With a voice 
portal, a user may access an IVR service or perform 
voice browsing using a speech browser. Unfortunately, 
a client/access device having only GUI capability would 
not be able to directly access information from a voice 

40 portal. Likewise, a client/access device having only 
speech I/O would not be able to access information in 
a GUI modality. Therefore, the bandwidth of man-ma- 
chine communication is currently limited, for example, 
by the available I/O modalities of the client/access de- 

<5 vice and the format of the content stored in the server 
providing the information. 

[0008] Other information sources that are currently 
available include the various service providers that pro- 
vide access to radio and television (TV) programs (for 

50 example, broadcasters, cable and satellite service pro- 
viders). Many of these service providers offer interactive 
TV and broadcast programs on demand. The conven- 
tional methods for providing interactive TV and broad- 
cast programs on demand, however, all rely on selection 

55 by the user of a particular program from a given set of 
catalogues. For example, a user can select to begin 
viewing a specific movie at a given time by individually 
ordering the movie. Alternatively, the user can join new 
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broadcasts starting at certain time (for example, every 
quarter hour). 

[0009] With interactive TV, using services such as 
WebTV etc.. the user can follow links associated with 
the program (for example, URL to web pages) to access 
related meta-information (that is, any relevant informa- 
tion such as additional information or raw text of a press 
release or pages of involved companies or parties, etc.). 
Other interactive TV uses include, for example, sending 
feedback to the broadcaster who can poll the viewer's 
opinion, selecting a video or film to view from a central 
bank of films, or modifying the end of the movie or pro- 
gram based on the viewer's request. Both WebTV and 
Interactive TV services utilize a set-top box or special 
set-top unit that connects to a television set. In addition, 
pay-per-view television, as well as TV services where 
viewers can vote (via telephone or the web) to select the 
next movie, can be considered as other forms of inter- 
active TV. In all such cases, however, the level of per- 
sonalization that may be achieved, for example, is very 
limited. 

[0010] On the Internet, various web sites (for exam- 
ple, Bloomberg TV or Broadcast.com) provide broad- 
casts from existing radio and television stations using 
streaming sound or streaming media techniques. Web 
broadcasts that use web-based video stream and audio 
streaming rely on pre-compiled video radio clip that the 
user can download and play a local machine such as a 
television or personal computer using, for example, Re- 
alNetworks Player or Microsoft Windows Media Player. 
[RealNetworks is a registered trademark of RealNet- 
works Inc., Windows Media is a trademark of Microsoft 
Corporation.] Indeed, in a WebTV interactive TV envi- 
ronment, the downloaded streamed program can be 
played on the TV. 

[001 1] In teletext systems, catalogues of ASCII meta 
information are downloaded with a TV program to the 
user's TV or set-top box. The user can then select de- 
sired items that are later downloaded. Eventually, new 
set-top boxes will offer the capability to store com- 
pressed versions of a program on a local hard disk or 
memory system to offer services such as pause or in- 
stant replay during a program. 
[0012] Although the multimedia services described 
above allow users to download programs of interest, 
these services do not offer the user the capability to ac- 
cess a true broadcast on demand service, where the us- 
er is able to compose his radio or TV program based on 
his interest. 

[0013] There is a need therefore for a system and 
method that provides multi-modal access to any infor- 
mation source (for example, the WWW), from any loca- 
tion, at anytime, through any type of client/access de- 
vice, so as to retrieve desired information and/or build 
a personalized broadcast program on demand, as well 
as manage and modify the program at any time. 



DISCLOSURE OF THE INVENTION 

[0014] The present invention is directed to systems 
and methods employing a conversational (multi-modal) 
5 portal to provide conversational (multi-modal) access to 
information over a communications network from any lo- 
cation, at any time, utilizing any type of client/access. In 
one aspect of the present invention, a conversational 
portal comprises a conversational (multi-modal) brows- 
10 er that is capable of conducting multi-modal dialog with 
client/access devices having varying input/output (I/O) 
modalities. The conversational browser retrieves infor- 
mation (such as content pages, applications) from an 
information source (for example, a content server locat- 
es ed on the Internet or an intranet/extranet) in response 
to a request from a requesting client/access device and 
then serves or presents the retrieved information to the 
requesting client/access device in a format that is com- 
patible with the I/O modalities of the requesting client/ 
20 access device. 

[001 5] In another aspect of the present invention, the 
content pages and applications provided by the content 
servers are multi-modal, implemented using CML (con- 
versational markup language). In one embodiment, 
25 CML is implemented in a modality-independent format 
using a plurality of conversational gestures that allow 
the conversational interactions (multi-modal dialog) to 
be described independently of the platform, or the mo- 
dality of the device or browser rendering/processing the 
30 content. The conversational portal can serve CML doc- 
uments directly to an access device running a conver- 
sational browser for local processing/rendering of the 
CML documents. 

[001 6] In another aspect of the invention, the conver- 

35 sational portal provides multi-channel access to the con- 
tent pages and applications by employing a transcoder 
that converts the modality-independent format (CML 
document) into at least one modality-specific format (for 
example, HTML, VoiceXML) based on the detected I/O 

to modalities of the requesting client/access device. 
[0017] In yet another aspect, the conversational portal 
provides multimedia access on demand. The conversa- 
tional portal comprises an audio indexing system for 
segmenting and indexing audio and multimedia data ob- 

<5 tained from an information source, as well as a multi- 
media database for storing the indexed audio and multi- 
media data. In response to a user request, the conver- 
sational browser obtains desired segments from the 
multimedia database presents such segments to the cli- 

50 ent based on the I/O capabilities of the client. The con- 
versational portal allows a subscribing user to compose 
and maintain a broadcast program wherein the user 
specifies which types, and in what order, different seg- 
ments (news, radio, etc.) are played back/broadcasted 

55 to the user. The broadcast program on demand service 
offered by the conversational portal can be accessed 
from any location at any time, using any type of access 
device. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

[0018] The present invention will now be described, 
by way of example only, with reference to preferred em- 
bodiments thereof as illustrated in the following draw- 
ings: 

Fig. 1 is a block diagram of system for accessing 
information via a conversational portal according to 
one embodiment of the present invention; 

Fig. 2 is a block diagram of system for accessing 
information via a conversational portal according to 
another embodiment of the present invention; 

Figs. 3a and 3b comprise a flow diagram of a meth- 
od for accessing information according to one as- 
pect of the present invention; 

Fig. 4 is a block diagram of an architecture of con- 
versational (multi-modal) browser that may be em- 
ployed in connection with the present invention; and 

Fig. 5 is a block diagram of architecture of another 
conversational (multi-modal) browser that may be 
employed in connection with the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 

[0019] The present invention is directed to systems 
and methods employing a "conversational portal" (com- 
prising a "conversational browser") to provide "conver- 
sational access" to information over a computer network 
from any location, at anytime, utilizing any type of client/ 
access device. It is to be understood that the term "con- 
versational" used herein refers to seamless multi-modal 
dialog (information exchanges) between user and ma- 
chine and between devices or platforms of varying mo- 
dalities (I/O capabilities), based on the capability of the 
access device/channel, preferably, using open, interop- 
erable protocols and standards. Multi-modal dialog 
comprises modalities such as speech-only (for exam- 
ple. VoiceXML), visual-only (GUI) (for example, HTML 
(hypertext markup language)) , restricted GUI (for ex- 
ample, WML (wireless markup language), CHTML 
(compact HTML), HDML (handheld device markup lan- 
guage)), and a combination of such modalities (for ex- 
ample, speech and GUI). In addition, each modality (or 
combination of modalities) may be implemented as a full 
NL (natural language) user interface, resulting in a uni- 
versal conversational user interface (CUI). 
[0020] The concepts of "conversational" interactions 
(or conversational computing) and "conversational 
browsing" are discussed in greater detail below as they 
relate to the exemplary embodiments described herein. 
Furthermore, detailed discussions of such concepts 
may be found, for example, in International Appl. No. 
PCT/US99/22927, filed on October 1, 1999, entitled: 



"Conversational Computing Via Conversational Virtual 
Machine", International Appl. No. PCT/US99/22925, 
filed on October 1, 1999, entitled: "System and Method 
For Providing Network Coordinated Conversational 

5 Services", and International Appl. No. PCT/ 
US99/23008. filed on October 1 , 1999, entitled "Conver- 
sational Browser and Conversational Systems," all of 
which are commonly assigned, and fully incorporated 
herein by reference (each of these International Appli- 

10 cations designate the United States and claim priority 
from U.S. Patent Application Serial Numbers 
60/102,957 filed October 2. 1998 and 60/117,595 filed 
January 27, 1999, which disclosures are also expressly 
incorporated herein by reference). 

is [0021] It is to be understood that the systems and 
methods described herein may be implemented in var- 
ious forms of hardware, software, firmware, special pur- 
pose processors, or a combination thereof. In particular, 
the present invention is preferably implemented as an 

20 application comprising program instructions that are 
tangibly embodied on a program storage device (for ex- 
ample, magnetic floppy disk, RAM, ROM, CD ROM, 
etc.) and executable by any device or machine compris- 
ing suitable architecture. It is to be further understood 

25 that, because some of the constituent system compo- 
nents and process steps depicted in the accompanying 
Figures are preferably implemented in software, the ac- 
tual connections between such components and steps 
may differ depending upon the manner in which the 

30 present invention is programmed. Given the teachings 
herein, one of ordinary skill in the related art will be able 
to contemplate these and similar implementations or 
configurations of the present invention. 
[0022] Referring now to Fig. 1, a block diagram illus- 

35 trates a system 1 0 according to one embodiment of the 
present invention for providing conversational access to 
information over a computer network. In general, the 
system 10 comprises a conversational portal 11 that 
processes multi- modal requests received from one or 

40 more client/access devices 12-16 and, in response, 
fetches desired content pages, services, and applica- 
tions over a network 17 (for example, the Internet, an 
Intranet, a LAN (local area network), or an ad hoc net- 
work such as Bluetooth) from one or more content serv- 
es ers 18 (for example, Web servers). The conversational 
portal 11 may comprise a web server and/or an IVR 
server that is associated with the service provider of the 
conversational portal 11. As described in detail below, 
the conversational portal 11 comprises a mechanism for 

50 conducting conversational dialog with a requesting cli- 
ent/access device based on the I/O modality (or modal- 
ities) of the client/access device. 
[0023] Each client/access device 12-16 is capable of 
establishing communication over a network 29 (for ex- 

55 ample, wireless, PSTN, LAN, Internet) to the conversa- 
tional portal 11. It is to be appreciated that the conver- 
sational portal 11 may be accessed via a phone number 
or a URL, independently of the modality. For instance, 
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depending on the configuration of the client/access de- 
vice 12-16, connection may be made to the conversa- 
tional portal 11 using a dial-up connection through a mo- 
dem or through an ISP for WML (or an address that is 
accessible directly off a cell phone or other wireless de- 5 
vice), an HTML browser client, a VoiceXML browser cli- 
ent via VoIP (voice over internet protocol), or other con- 
versational protocols as described in the above-incor- 
porated International Appln. Nos. PCT/US99/22927and 
PCT/US99/22925. Similarly, a phone number can be 
used to provide direct access to the conversational por- 
tal 11 for all these modalities (that is, a direct phone cat) 
or ISP function offered directly by the conversational 
portal 11). 

[0024] The content servers 18 maintain correspond- 
ing content/business logic 19 and perform appropriate 
database and legacy system operations (for example, 
via CGI scripts, etc.). The content pages and applica- 
tions in database 1 9 may be implemented in one or more 
legacy formats such as HTML. HDML. XML, WML. and 
any Speech ML format (such as the recent VoiceXML 
standard that has been proposed as a standard for de- 
claratively describing the conversational Ul for. for ex- 
ample, speech browsers and IVR platforms (see, http:// 
www.voicexml.org )). 

[0025] In a preferred embodiment, the content pages 
and applications are multi-modal, implemented using a 
CML (conversational markup language). In general, 
CML refers to any language which specrfies/builds a 
conversational dialog (multi-modal information ex- 
changes and interactions) to be conducted with the user 
based on the desired application. A CML document may 
be any declarative page, for example, that comprises 
the information needed to build such interaction. 
[0026] It is to be appreciated that CML documents and 
CML applications may be implemented in one of various 
manners. In a preferred embodiment, the CML content 
is implemented in a modality-independent, single au- 
thoring format using a plurality of "conversational ges- 
tures" such as described, for example, in U.S. Serial 
Number 09.544,823, filed on April 6. 2000. entitled: 
"Methods and Systems For Multi-Modal Browsing and 
Implementation of A Conversational Markup Lan- 
guage", which is commonly assigned and fully incorpo- 
rated herein by reference. Briefly, conversational ges- 
tures are elementary dialog components that character- 
ize the dialog interaction with the user and provide ab- 
stract representation of the dialog independently of the 
characteristics and Ul offered by the device or applica- 
tion rendering the presentation material. Conversational 
gestures may be implemented either declaratively (for 
example, using XML) to describe the dialog or impera- 
tively/procedurally. 

[0027] Advantageously, the use of conversational 
gestures (to generate content/applications) allows con- 
versational interactions to be described independently 
of the platform, browser, modality or capability of the de- 
vice processing or rendering the content. As described 



in detail below, a multi-modal documents such as a ges- 
ture-based CML document can be processed using a 
conversational (multi-modal) browser to provide tight 
synchronization between the different views supported 
by the multi-modal browser. Furthermore, using specific 
predefined rules, the content of a gesture-based CML 
document can be automatically transcoded to the mo- 
dality or modalities supported by the particular client 
browser or access device. For instance, a CML docu- 
ment can be converted to an appropriate declarative 
language such as HTML. XHTML, or XML (for automat- 
ed business-to-business exchanges), WML for wireless 
portals and VoiceXML for speech applications and IVR 
systems. Indeed, as described below, the conversation- 
al portal 11 comprises a mechanism for transcoding/ 
adapting the CML page or application to the particular 
modality or modalities of the client/access device. Ac- 
cordingly, it is to be appreciated that regardless of the 
set of conversational gestures used or the transcoding 
method employed, such an approach enables a true 
"multi-modal/ multi-channel" conversational portal as 
described herein (that is, "multi-modal" in the sense that 
the conversational portal 11 can serve multi-modal doc- 
uments (such as gesture- based CML documents) to an 
access device running a conversational (multi-modal) 
browser to processing/rendering by the local conversa- 
tional browser, and "multi-channei" in the sense that the 
conversational portal 11 can serve the content of multi- 
modal CML documents to legacy browsers (for exam- 
ple, HTML, VoiceXML. WML) by converting CML to the 
supported modality. 

[0028] In another embodiment, a multi-modal CML 
document may be implemented by incorporating a plu- 
rality of visual and aural markup languages (that is, a 
CML document that comprises sub-documents from dif- 
ferent interaction modalities). For example, a CML doc- 
ument may be implemented by embedding in a single 
document, markup elements from each of a plurality of 
represented/supported modalities (for example, 
VoiceXML and HTML tags), and using synchronizing 
tags to synchronize the different ML content (that is, to 
synchronize an action of a given command in one mo- 
dality with corresponding actions in the other supported 
modalities) on an element-by-element basis. These 
techniques are described, for example, in the above-in- 
corporated application International Appl. No. PCT/ 
US99/23008, as well as U.S. Serial Number 
09/507,526, filed on February 18, 2000, entitled: "Sys- 
tems and Methods For Synchronizing Mutti-Modal Inter- 
actions," which is commonly assigned and fully incorpo- 
rated herein by reference. 

[0029] The main difference between a gesture-based 
CML document and a CML document comprising mul- 
tiple MLs is that the gesture-based approach offers sin- 
gle authoring whereas the multiple ML approach re- 
quires multiple authoring. In addition, the gesture-based 
approach provides "tight" synchronization in multi-mo- 
dal browsing implementations, which is more difficult to 
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achieve using the multiple ML approach (which often af- 
fords "loose" synchronization). In any event, multi-mo- 
dal CML documents may be transformed to standalone 
documents for specific interaction modalities using, for 
example, standard tree transformations as expressible 
in the known standards XSLT or XSL. Other transcoding 
techniques may be used such as JSP (Java server pag- 
es) or Java Beans that implement similar transforma- 
tions of the CML pages on a gesture-by-gesture basis. 
Additional transcoding techniques that may be imple- 
mented are discussed, for example, at http://www.w3c. 
org . Indeed, the implementation of multi-modal docu- 
ments, which can be transformed to documents of de- 
sired modalities, ensures content reuse and meets the 
accessibility requirements (for example, a multi-modal 
document designed with combined visual and aural mo- 
dalities can be used in environments where only one 
modality is available). 

[0030] Referring again to the exemplary embodiment 
of Fig. 1 , the conversational portal 11 comprises a portal 
proxy/capture module 20, a portal transcoder 21 , a por- 
tal conversational browser 22, a search engine 23, a 
portal speech browser 24, a database of portal applica- 
tions 25 and a database of portal directories 26. The por- 
tal conversational browser 22 is responsible for perform- 
ing functions such as fetching the desired pages, etc., 
(using any conventional transport protocol such as HT- 
TP, WAP, or Bluetooth) in response to client requests 
and parsing and processing the declarative framework 
(including any embedded procedural specifications 
such as applets) comprising a CML page, for example, 
to implement the conversational dialog between the giv- 
en client/access device 12-16 and the conversational 
portal 11. 

[0031] It is to be appreciated that the portal conversa- 
tional browser 22 together with the CML implementation 
comprises a mechanism for translating conversational 
(multi-modal) I/O events into either (i) the corresponding 
application actions (in other modalities) or (ii) the dialogs 
that are needed to disambiguate, complete or correct 
the understanding of an input event to thereby generate 
the appropriate action. The portal conversational brows- 
er 22 will either render the conversational Ul comprising 
the fetched pages for presentation to the user (assum- 
ing the access device does not comprise a local client 
browser) or serve the pages to the client/access device 
12-16 for rendering/presentation by the local client 
browser. 

[0032] Although any suitable multi-modal browser 
may be implemented in the conversational portal 11, 
one preferred architecture for the portal conversational 
browser 22 is illustrated in Fig. 4 and described in detail 
in the above- incorporated U.S. Serial No. 09/507,526. 
Briefly, as illustrated in Fig. 4, a conversational (multi- 
modal) browser 40 comprises a plurality of mono-mode 
browsers (for example, a visual browser 44 (HTML) and 
a speech browser 45 (VoiceXMI) as shown), a multi-mo- 
dal shell API 41 and a multi-modal shell 42 having a reg- 



istration table 43 (the multi-modal shell 42 executes on 
top of any conventional operation system/platform). The 
multi-modal shell 42 functions as a virtual main browser 
which processes CML documents retrieved over the 

5 network 17 from a content server 18. 

[0033] The multi-modal shell 42 coordinates the infor- 
mation exchange via API calls that allow each mono- 
mode browser application 44, 45 to register its active 
commands and corresponding actions (both inter and 

io intra mode processes as well as actions on other proc- 
esses). Such registration may include any relevant ar- 
guments to perform the appropriate task(s) associated 
with such commands. 

[0034] The registration table 43 of the multi-modal 

15 shell 42 is a registry that is implemented as an "n-way" 
command/event-to-action registration table, wherein 
each registered command or event in the table indicates 
a particular action that results in each of the "n" modal- 
ities that are synchronized and shared for the active ap- 

20 plication. The multi-modal shell 42 parses a retrieved 
CML document to build the synchronization via the reg- 
istration table 43 and send the relevant modality specific 
information (for example, markup language) comprising 
the CML document to each browser for rendering based 

25 on its interaction modality (using the techniques de- 
scribed, for example, in the above-incorporated appli- 
cation U.S. Serial No. 09/544,823. It is to be understood 
that although the conversational multi-modal browser 
40 is illustrated comprising a separate browser appiica- 

30 tion for each supported modality, as well as a separate 
multi-modal shell layer, it is to be appreciated that the 
functionalities of these components may be merged into 
one application comprising the conversational (multi- 
modal) browser 40. In addition, the components of the 

35 multi-modal browser may be distributed. For instance, 
the multi-modal shell 42 may reside on the conversa- 
tional portal 1 1 , whereas one of the browsers 44 and 45 
(or both) may reside on a client access device, with the 
mufti-modal shell 42 providing the CML parsing and syn- 

*o chronization. . 

[0035] Fig. 5 illustrates another preferred architecture 
for the portal conversational browser 22 that may be em- 
ployed utilizing a CVM (conversational virtual machine) 
when more complex conversational computing features 

« are required, such as described in the above incorpo- 
rated-applications International Appl. Nos. PCX I 
US99/23008 and PCT/US99/22927. In the embodiment 
of Fig. 5, the functionalities of the multi-modal shell 42 
may be implemented in a core CVM kernel 55. A de- 

50 scription of the architecture depicted in Fig. 5 is provided 
below. 

[0036] Referring again to Fig. 1, the conversational 
portal 11 comprises a search engine 23 of any suitable 
conventional type comprising applications known as ro- 
55 bots, spiders or crawlers which search the network 17 
for content pages. Various content pages may be in- 
dexed within a searchable database of the search en- 
gine 23, that is, the portal directories database 26. Upon 
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receiving an interpreted query from the portal conversa- 
tional browser 22 to perform a search, the search engine 
23 will execute the query and search the network 1 7 and 
portal directories 26 to locate desired sites, content pag- 
es and broadcasts on the content servers 1 8 and returns 
a ranked list of possible matches in CML format (for ex- 
ample, resulting sites are ranked by percentage of how 
close the site is to the topic that was searched as is un- 
derstood by those in the art). The ranked list is rendered 
back to the user via, for example, the portal conversa- 
tional browser 22 for presentation to the user and selec- 
tion by the user via conversational dialog. 
[0037] It is to be understood that the search engine 
23 will locate content pages in CML, HTML, XML or oth- 
er legacy or new language formats (although the pages 
may be converted into different modalities based on the 
I/O capabilities of the requesting client/access device). 
It is to be understood that any conventional query format 
may be utilized by the search engine 23. For instance, 
the search engine 23 may support NLU queries, or sim- 
ply keyword, Boolean and concept/attribute based que- 
ries, based on the technology available for the search 
engine. Furthermore, since the conversational portal 1 1 
preferably provides a conversational user interface with 
CML, the search engine can support any possible I/O 
modality and combination of modalities. Multi-lingual 
searches can also be considered using the following 
method. Queries are mapped into symbolic representa- 
tions (attribute value pairs). The attribute value pairs are 
used to perform a direct semantic translation (that is not 
necessary literal) to other languages. The new query is 
then used to perform the search of the documents in 
other languages. 

[0038] In the "multi-channel" aspect where the con- 
versational portal 11 supports multiple channels, the 
portal transcoder 21 will be utilized to transcode fetched 
documents (that are selected by the user) to the sup- 
ported modality (or modalities) of the requesting client/ 
access device. More specifically, based on the detected 
modality (or modalities) of the requesting client/access 
device, the portal transcoder 21 will transform a multi- 
modal document (for example, a gesture-based CML 
document), which is parsed and output from the portal 
conversational browser 22, into one or more modality- 
specific formats. 

[0039] By way of example, as shown in Fig. 1 , a client/ 
access device may be a local legacy browser such as 
an HTML browser 1 3a. WML browser 14a, or VoiceXML 
browser 15a, each running on a multi-modal or mono- 
modal device such as a personal computer (GUI and 
speech), mobile telephone (speech only or speech and 
limited GUI), smartphone (speech and limited GUI), 
PDA (limited GUI only), etc. In addition, the access de- 
vice may be a conventional telephone 16 (speech I/O 
only) that interacts with the conversational portal 11 
through the portal speech browser 24, wherein the por- 
tal speech browser 24 processes VoiceXML documents 
to provide IVR services, for example. Indeed, in the pre- 



ferred embodiment where the content is stored/con- 
structed in CML, it is to be appreciated that the conver- 
sational portal 11 can directly serve any of these chan- 
nels or client/access devices by transcoding (on-the-fly) 

5 each CML page to the supported ML. For example, a 
CML document may be transformed into (1) HTML to 
support Internet access (via HTTP) using a traditional 
browser having a GUI modality; (2) WML to support 
wireless access (via WAP) over a wireless network us- 

10 ing a WML browser; (3) VoiceXML to support traditional 
telephone access over PSTN using a speech browser; 
or (4) any other current or future MLs that may be de- 
veloped. 

[0040] The portal transcoder 2 1 employs one or more 
15 transcoding techniques for transforming a CML page to 
one or more legacy formats. For instance, such trans- 
formations may be performed using predetermined 
transcoding rules. More specifically, such transforma- 
tions may be encapsulated in device-specific and mo- 
20 dality-specific XSL stylesheets (such as described in the 
above-incorporated applications U.S. Serial No. 
09/507,526 and U.S. Serial No. 09/544,823). Further- 
more, as indicated above, other transcoding techniques 
may be used such as JSP or Java Beans that implement 
25 similar transformations of the CML pages on a gesture- 
by-gesture basis. 

[0041] It is to be appreciated that the portal transcoder 
21 performs other functions such as mapping back any 
user interaction in a given modality to the CML repre- 

30 sentation (for synchronized multi-modal interactions, 
the interaction in the one modality will then be reflected 
across all the other synchronized modalities). It is to be 
further appreciated that the functionalities of the portal 
transcoder 21 may be incorporated within the portal con- 

35 versational browser 22. For instance, with the architec- 
ture of the conversational browser 40 of Fig. 4, the multi- 
modal shell 42 will perform functions such as dynamic 
transcoding of multi-modal documents to modality-spe- 
cific representations and synchronization between the 

to different modalities. 

[0042] In the "multi-modal" aspect where the conver- 
sational portal 11 serves multi-modal CML pages for 
rendering by a local conversational (multi-modal) 
browser 12a, the portal transcoder 21 is not utilized 

45 since any required transcoding/synchronizing functions 
are performed by the local conversational browser 21a 
on the client side. Indeed, it is to be appreciated that in 
the case of the multi-modal client/access device 12 run- 
ning a local conversational (multi-modal) browser 12a 

so (having an architecture as described above with respect 
to Fig. 4), the conversational portal 11 will serve a 
fetched CML document directly to the local conversa- 
tional (multi-modal) browser 12a, wherein the CML doc- 
ument is dynamically transcoded (via, for example, the 

55 multi-modal shell) into different synchronized modalities 
(for example. WML and VoiceXML documents that are 
tightly synchronized for a multi-modal WAP browser 
(that is micro-browser for the WML modality) or HTML 
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and VoiceXML for a tightly synchronized conversational 
(multi-modal) browser comprising a speech browser (lo- 
cal or remote) and a HTML browser. 
[0043] In both the "multi-channel" and •multi-modal" 
aspects, it is to be appreciated that the conversational 
portal 11 detects the channel and the capability of the 
client browser and/or access device to determine which 
modality (presentation format) to convert a CML docu- 
ment, if necessary. Byway of example, the access chan- 
nel or modality of the client/access device may be de- 
termined by (i) the type of query or the address request- 
ed (for example, a query for a WML page implies that 
the client is a WML browser), (ii) the access channel (for 
example a telephone access implies voice only, a GPRS 
network access implies voice and data capability, and a 
WAP communication implies that access is WML), (iii) 
user preferences (a user may be identified by the calling 
number, calling IP, biometric, password, cookies, etc.) 
and/or (iv), in the case of the conversational browser 
client, registration protocols as described in the above- 
incorporated International Appln. Nos. PCT/ 
US99/22927 and PCT/US99/22925. 
[0044] The system 10 of Fig. 1 further comprises a 
conversational proxy server 27 having a transcoder 28, 
which may be used to transcode pages/applications of 
one or more sites of a given content provider from a leg- 
acy format into CML format (and/or other legacy for- 
mats). The proxy server 27 may be directly affiliated 
with, for example, the content provider or a third-party 
contracted by the content provider, to transcode the site 
(s) of the content provider, store the transcoded site(s), 
and periodically update the stored (transcoded) content 
when the original site is modified by the content provider. 
For instance, a service provider of an HTML-based site 
may employ the transcoding services of the proxy server 
27 to convert the HTML content of the site to a CML 
format. Such transcoding is particularly applicable for 
the client/access device 12 running a conversational 
(multi-modal) browser 12a, whereby a user can conduct 
multi-modal browsing when accessing sites comprising 
documents/applications that are strictly in conventional 
ML formats. In this manner, the conversational portal 11 
can subsequently fetch such transcoded pages (for ex- 
ample, CML pages) from the proxy server 27 as of such 
pages were fetched directly from the sites. 
[0045] The use of the proxy server 27 allows the con- 
tent provider to control the manner in which its content 
is rendered to the user (either by the portal conversa- 
tional browser 22 or a client browser), as opposed to 
relying on unknown porta I transcoders for converting the 
pages/applications of the content provider into one or 
more desired modalities. Indeed, it may be the case that 
the portal transcoder 21 lacks specific proprietary infor- 
mation about the particular legacy documents, applica- 
tions and/or business logic of the content provider to ad- 
equately perform such conversion (which information is 
known only by the content provider or provided by the 
content provider to the contracted third-party). 



[0046] It is to be understood that the transcoding serv- 
ices of the proxy server 27 may be performed using au- 
tomatic transcoding techniques. For instance, the trans- 
coder 28 may transcode conventional (legacy) struc- 

s tured document formats such as HTML, WML. or DB2 
into a CML document using prespecified transcoding 
rules. Basic composition and design rules can be im- 
posed (that are either proprietary the object of a stand- 
ard) to simplify the conversion from legacy formats such 

10 as HTML to CML (such as the transcoding rules de- 
scribed in the above-incorporated International Appl. 
No. PCT/US99/23008 for converting HTML to a 
speechMI (VoiceXML)). It is to be understood that other 
techniques may be employed for transcoding HTML (or 

15 other legacy ML formats) to CML, such as using extrac- 
tion of gestures and gesture patterns. For example, by 
reverse engineering transcoded pages produced from 
CML to HTML, a large set of HTML tag patterns can be 
mapped to specific CML gestures or groups of gestures. 

?o Details of the additional HTML tags can be either also 
transformed into CML patterns or added to the CML 
page as HTML tags embedded in the page. This last 
approach may be used for details that are not related to 
the gestures but directly related to additional modality- 

25 specific (in this example HTML) rendering information 
that is not worth capturing in a gesture (for example dis- 
play of an image). In addition, the transcoder 28 may 
utilize meta-information that is added to legacy pages 
for transcoding purposes. 

30 [0047] In addition, when the results of automatic 
transcoding by the transcoder 28 are incomplete or not 
accurate, or when the service provider of the proxy serv- 
er 27 wants to increase the quality of the transcoding 
results, human operators can be employed to manually 

35 review, correct and/or com plete the resu Its of the tra ns- 
coding. Indeed, until all web sites either are universally 
authored in CML, follow appropriate/standard construc- 
tion rules, or add appropriate meta -inform ati on/hints to 
support fully automated transcoding, the manual review/ 

*o transcode option is especially advantageous to efficient- 
ly and accurately model sites having complex business 
logic. 

[0048] It is to be appreciated that, based on different 
business models, the conversational portal 11 can offer 

45 a service to content providers 18 to have their content 
pages/applications prepared or adapted in CML for bet- 
ter conversational rendering. For instance, the conver- 
sational portal 11 can offer (to registered web sites) the 
option of having their existing content pages/applica- 

50 tions (in legacy formats) pre-transcoded to CML and 
stored in the portal directory database 26, so as to en- 
sure that such pages can be subsequently served with 
quality rendering across different modalities. Such 
transcoding may be done via the proxy server 27 (as- 

55 suming it is affiliated with the service provider of the con- 
versational portal 11). In addition, such transcoding may 
be performed directly by operators of the portal obtain- 
ing information directly from the web site via a specific 
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partnering/fee/business agreement. A mechanism can 
be employed (that is a crawler checking the original site 
or notification agreement when changes occur) to detect 
changes of the site and accordingly update the trans- 
coded content in the site. Furthermore, as discussed 
above, when the results of automatic transcoding (via 
transcoder 28) are incomplete or not accurate, or when 
the service provider of the conversational portal 11 
wants to increase the quality of the transcoding results, 
human operators can be employed to manually review, 
correct and/or complete the results of the transcoding. 
Pages that are reviewed and corrected may be stored 
in the portal directories 26. In addition, parts of pages 
or patterns may be stored in the portal directories 26. 
[0049] Furthermore, the service provider of the con- 
versational portal 11 can provide a service of generat- 
ing, in the first instance, a "conversational" web site of 
company or individual and hosting the conversational 
web site on the conversational portal 11 hardware and/ 
or network. Indeed, the conversational portal 11 service 
can generate a plurality of CML pages associated with 
the "conversational" web site and store such CML pages 
in the portal directory database 26. Again, it is to be un- 
derstood that that the service provider of the conversa- 
tional portal 1 1 may offer these various services based 
on different business models and service offerings. 
[0050] Accordingly, the portal directory database 26 
may store content pages/applications one or more con- 
tent providers, which are either pre-transcoded or de- 
signed in CML to provide for efficient conversational ren- 
dering. During a search process, the search engine 23 
will search for requested content in the portal directories 
26 in addition to the web search. Furthermore, some of 
the links in the portal directories 26 can also include con- 
versational applications 25 (for example, multi-modal 
procedural applications built on top of CVM). The con- 
versational applications 25 are any regular application 
developed imperatively (that is by compiling imperative 
code), declaratively (that is built with markup languages) 
or a combination of both, to deliver an application with 
a "conversational user interface", that is. to let the user 
access and manipulate the related information at any 
time, from any where through any device and with a 
same behaviour, by carrying a modality independent di- 
alog. Examples of such applications include universal 
messaging (accessing and processing e-mail, fax, and 
voice mail) via CUI, calendaring, e-business applica- 
tions, etc.). It is to be appreciated that these portal con- 
versational applications 25 may be directly offered by 
the service provider of the conversational portal 11 or 
hosted by the conversational portal on behalf of a com- 
pany or individual. Again, all these services may be of- 
fered pursuant to various business models. 
[0051] The portal proxy/capture module 20 is an op- 
tional feature that may be incorporated within the con- 
versational portal 11 to "capture" a telephone call or 
browser connection (for example, HTTP, WAP, etc.) 
made to the conversational portal 11. For example, in a 



preferred embodiment, when a client/access device 
12-16 (for example, a smartphone, HTML browser, 
WML browser, conversational browser) connects to the 
conversational portal 11 and enters a request, the con- 

5 versational portal 1 1 will maintain the call/client browser 
captive for any link that is provided by the conversational 
portal 11 and followed by the user. More specifically, any 
link that is provided by the conversational portal 11 that 
results from either an initial request through the portal 

'0 11 or from a page that is subsequently loaded by the 
portal is fetched by the portal conversational browser 22 
(as opposed to the client browser) and served to the cli- 
ent browser. The portal/proxy capture module 20 will 
hold the client captive during the time in which it takes 

15 to fetch the link, possibly transcode the link to the ap- 
propriate modality specific markup language (for exam- 
ple, WML for a WAP browser, HTML for a web browser, 
VoiceXML for a speech browser (telephony access)), 
and serve any fetched page to the client browser (which 

20 can be the speech browser 24 on the server side in the 
case of a telephony access). 

[0052] On the other hand, in the preferred embodi- 
ment, the pages that are directly requested/entered 
manually by the user (URL explicitly entered, bookmark, 

25 link generated by other applications) are relinquished to 
the client browser for fetching the appropriate pages by 
the client browser (that is, the client browser is not held 
captive). It is to be understood that other policies may 
be employed with respect to the call capture feature, for 

30 example, the conversational portal 1 1 may capture the 
call during an entire session (that is, no release at all), 
or the capture period may vary based on the circum- 
stances (as decided by the conversational portal 11). 
[0053] Advantageously, during periods in which the 

35 call/client browser is held captive, the conversational 
portal 11 service can continuously listen/participate in 
the "conversation" and offer additional services and pro- 
vide advertisements to the user. For instance, in a pre- 
ferred embodiment, multi-modal advertisements can be 

<o provided to a "captive" user during the time period be- 
tween page fetches from site to site (but not necessarily 
the time period between page fetches of the same ap- 
plication from the same server). Again, the time in which 
advertisements are provided may vary based on the pol- 

45 jcies of the conversational portal 11. It is to be appreci- 
ated that the advertisements are a pure multi-modal fea- 
ture. Indeed, advertisements can be displayed, ren- 
dered using audio, or both, depending on the modalities 
of the client/access device. Moreover, in specific por- 

50 tions of a multi-modal document (rendered by a multi- 
modal browser), advertisements can be added in 
frames that are separate from the content. Again, there 
are various options that may be implemented by the 
service provider of the conversational portal. 

55 [0054] Referring now to Fig. 5. a block diagram illus- 
trates another preferred architecture of a conversational 
browser that may be employed in the system of Fig. 1. 
This architecture is described in greater detail in the 
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above-incorporated International Apptn. No. PCT/ 
US99/23008. The conversational (multimodal) browser 
50 executes on top of a CVM shell 53. The conversa- 
tional browser 50 comprises a CML parser/processor 
module 52 which parses a CML document and process- 
es the meta-information of the CML document to render 
the document for presentation to the user. The conver- 
sational browser 50 further comprises a command/re- 
quest processor 51 (for example, a command and con- 
trol interface and HTTP server) which interprets user 
commands/requests (multi-modal) such as speech 
commands, DTMF signals, keyboard input, etc. When 
certain conversational functions or services are needed, 
the conversational browser 50 will make API calls to the 
CVM 53 requesting such services (as described below). 
For instance, when interpreting a CML document (via 
the CML parser/processor 52), the conversational 
browser 50 may hook to a TTS (text-to-speech synthe- 
ses) engine 67 (via the CVM shell 53) to provide syn- 
thesized speech output to the user. In addition, when 
speech commands or natural language queries (for ex- 
ample, navigation requests) are input, the conversation- 
al browser 50 may hook to a speech recognition engine 
64 and NLU (natural language understanding) engine 
66 to process such input commands, thereby allowing 
the command/request processor 51 to generate the ap- 
propriate requests/queries. - 

[0055] The CVM shell 53 can run on top of any con- 
ventional OS (operating system) or RTOS (real-time op- 
erating system). A detailed discussion of the architec- 
ture and operation of the CVM shell 53 is provided in the 
above-incorporated International Appln. No. PCT/ 
US99/22927 (and related provisional applications). 
Briefly, as shown in Fig. 5, the CVM shell 53 comprises 
a conversational API layer 54 through which the conver- 
sational browser 50 can "talk" to a CVM kernel layer 55 
to access (via system calls) certain conversational serv- 
ices and behaviours including the conversational en- 
gines 63. The CVM kernel 55 is responsible for allocat- 
ing conversational resources such as engines and ar- 
guments (either local and/or distributed) and managing 
and controlling the dialog and context across multiple 
- applications and devices (locally and/or distributed) on 
the basis of their registered conversational capabilities/ 
requirements to thereby provide a universal and coordi- 
nated CUI (conversational user interface). The CVM 
shell 53 performs conversational services and functions 
by implementing calls to local conversational engines 
63, for example, a speech recognition engine 64, a 
speaker identification/verification engine 65, a NLU 
(natural language understanding) engine 66 a TTS 
(text-to- speech) engine 67 (as well as other engines 
such as an NLG (natural language generation) engine) 
through a conversational engine API layer 56 (such as 
SAPI, SRAPI, JSAPI, SVAPI or extensions of such en- 
gine APIs). In addition, engine calls can be made to re- 
mote speech engines in distributed topologies. Moreo- 
ver, calls to an audio subsystem 62 (providing audio 



capture, compression, decompression and reconstruc- 
tion) and DTMF engine 61 ) may be performed via a con- 
ventional drivers/API layer 60. 
[0056] The CVM shell 53 further comprises a commu- 
5 nication stack 57 for providing network communication 
via conventional protocols 58 such as TCP/IP, HTTP, 
WAP, etc. The communication stack 57 further compris- 
es conversational protocols 59 (or distributed conversa- 
tional protocols) which are utilized for distributed appl'h 

w cations. As described in the above-incorporated appli- 
cations, the conversational protocols (or methods) 59 
include protocols for (1) discovering network devices 
and applications that are "conversationally aware" (that 
is, that speak conversational protocols); (2) registering 

15 conversational capabilities (resources) such as conver- 
sational engines and arguments between network de- 
vices; (3) negotiating network configurations (such as 
master/slave, peer-to-peer) based on registered con- 
versational capabilities; (4) exchanging information to 

20 coordinate a conversation between network connected 
devices such as information regarding the state, context 
and history of a dialog, conversational arguments, ap- 
plets, ActiveX components, procedural objects, and oth- 
er executable code; and (5) speech coding protocols to 

25 transmit and receive compressed speech (waveforms 
or features). These conversational protocols 59, as well 
as their role in providing conversational coordination be- 
tween networked devices are described in further detail 
in the above-incorporated International Appl. No. PCT/ 

30 US99/22925, for example. 

[0057] It is to be understood that the engines 63, DT- 
MF engine 61, conventional drivers/APIs 60 and audio 
subsystem 62 illustrated in Fig. 5 are components that 
are part of the underlying device, machine or platform 

35 on which the conversational browser 50 and CVM shell 
53 are executed. It is Jo be further understood that the 
conversational browser 50 and CVM shell 53 can be 
provided as separate systems or, alternatively, the con- 
versational browser 50 can be implemented as a stand- 
ee alone application carrying its own CVM shell 53 (in 
which case the browser and CVM platform would be the 
same, that is, indistinguishable entities). In addition, in 
the absence of a CVM shell 534 as specifically de- 
scribed above, it is to be understood that the conversa- 

45 tional browser 50 can incorporate all the functionalities 
and features of the CVM shell 53 as discussed (for ex- 
ample the conversational browser would make API calls 
to appropriate engines locally and/or distributed). In- 
deed, the API, services, features, behaviours, access to 

50 engine and communication mechanisms can all be built 
directly into, and made part of, the conversational 
browser 50 as part of the features and services provided 
by the browser. 

[0058] Referring now to Fig. 2, a block diagram illus- 
55 trates a system according to another embodiment of the 
present invention for accessing information using the 
conversational portal 1 1 . The system 1 0 of Fig. 2. which 
is an extension of the system depicted in Fig. 1, addi- 
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tionally provides multi-modal broadcast on demand 
services. More specifically, the system of Fig. 2 compris- 
es an audio indexing system 30 that performs, prefera- 
bly, real-time indexing of audio/multimedia documents 
or streamed audio and/or streamed multimedia such as 
broadcast news, radio news programs, and web broad- 
casts that are accessed from certain content providers 
18 over the network 17. Broadcasts can include audio 
and video productions ranging from news to entertain- 
ment (live or prerecorded). The index meta-information 
associated with, for example, a given broadcast or mul- 
timedia document may be stored in a database 31 of 
multi-modal broadcast content. A user can connect to 
the conversational portal 11 using any type of client/ac- 
cess device and search the database 31 using the index 
meta-information to access, for example, desired seg- 
ments of certain broadcasts or audio files. Depending 
on the capabilities of the client/access device, either the 
portal conversational browser 22 can render/present 
any desired segments of streamed video or audio via, 
for example, a plug-in such as a multi-media player (for 
example, Realnetworks player or any other application 
that plays IP broadcast streams) or the segments may 
be retrieved and broadcasted/streamed to a client 
browser on the access device for rendering/playback to 
the user. 

[0059] It is to be understood that any suitable conven- 
tional audio indexing system may be employed in the 
system of Fig. 2. A preferred audio indexing system is 
the system disclosed in U.S. Serial No. 09/294,214, filed 
April 16, 1999, entitled: "System and Method for index- 
ing And Querying Audio Archives," which is commonly 
assigned and incorporated herein by reference. Briefly, 
in one embodiment, the above incorporated audio in- 
dexing system 30 will segment and index an audio or 
multimedia file, or news or radio broadcast, based on, 
for example, audio information such as speaker identity, 
environment, topic, and/or channel, for storage in the 
database 31. Initially, relevant features of an audio file 
or audio data stream (received in real-time) are extract- 
ed and processed to segment the audio data into a plu- 
rality of segments based on, for example, the speech of 
distinct speakers, music, noise, and different back- 
ground conditions. For instance, a typical radio broad- 
cast news report contains speech and non- speech sig- 
nals from a large variety of sources including clean 
speech, band-limited speech (produced by various 
types of microphones) telephone speech, music seg- 
ments, speech over music, speech over ambient noise, 
speech over speech, etc. For each segment, the audio 
indexing system 30 will identify the particular speaker 
and/or background environment/channel, as well as 
transcribe the spoken utterance and determine the rel- 
evant content/topic of the segment, so as to index the 
segments and detect their topic based on such data. 
[0060] Accordingly, the database 31 may store any 
combination of the following meta-information for each 
multi-media document/stream: time marks (indicating 



the time boundaries of the segments), identity of the 
speaker (if meaningful), segmentation of changes of 
speakers (if applicable), a transcription of the spoken 
portions of the segments, environment information (mu- 
5 sic, telephony speech, etc.), the topic of a segment, 
boundaries of detected changes of topic, indexes and 
attribute value pairs / features (in the maximum entropy 
sense) of the segment/story, language and language 
boundaries. 

w [0061] In addition, the audio indexing system 30 com- 
prises an information retrieval system (or search en- 
gine) that utilizes the index meta-information to search 
and retrieve desired segments of audio/multimedia files 
stored in the database 31. In particular, query parame- 
15 ters can include any combination of the different index 
meta-information such as speaker identity (ID tags), en- 
vironment/channel, keywords/content and/or topics/ 
NLU content, so as to retrieve desired segments from 
the database 31. 

[0062] As indicated above, the conversational portal 
11 can access the servers of content providers 18 to in- 
dex, for example, one or more broadcast news and radio 
news program in real- time. Such access may be in re- 
sponse to a user query that is issued upon connection 
with the conversational portal 11 to request a search in 
real-time such as for relevant news segments about a 
given topic. For instance, the user can access the con- 
versational portal 11 via the conventional telephone 16 
for example and issue a search request such as for au- 
dio segments of current news regarding the stock mar- 
ket (which search request is interpreted via the speech 
browser 24 and/or portal conversational browser 22). 
The search engine 23 will then access relevant sites to 
retrieve one or more streamed broadcasts, which are 
then segmented and indexed via the audio indexing sys- 
tem 30. A ranked list of segments is rendered and pre- 
sented to the user via conversational dialog through the 
speech browser 24 (assuming user access via the tele- 
phone) or the portal conversational browser 22. 
Through conversational dialog, the user can then select 
the desired segments for playback, and the speech 
browser 24 (or portal conversational browser 22 in the 
case of multi-modal content) plays back the relevant 
segments to the user, without necessarily storing (long- 
term) such segments) and indexing meta-information in 
the database 31 for subsequent access. It is to be ap- 
preciated that by using a multi- modal client/access de- 
vice, the user can request multi-modal broadcast on de- 
mand to obtain audio-visual segments of interest and 
navigate the multi-modal presentation/stream/broad- 
cast using a conversational/multi-modal user interface. 
[0063] Furthermore, the content providers of such 
broadcasts may be affiliated with and otherwise regis- 
tered with the service provider of the conversational por- 
tal 1 1 such that streaming audio/muiti-media or other rel- 
evant documents (audio and multi-media) of such con- 
tent providers are automatically downloaded and in- 
dexed (on a periodic basis) for subsequent access by 
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authorized users of the conversational portal 11 . In this 
manner, a user can connect with the conversational por- 
tal and issue a query to directly search the database 31 
and retrieve one or more pre-indexed multi-media seg- 
ments having desired content (in lieu of or in addition to 5 
a search over the network). The user can compose a 
broadcast program wherein the user may specify the or- 
der in which the different segments are played back/ 
broadcasted and, for example, listen to the program on 
a cell phone or other connected device. 10 
[0064] Furthermore, by periodically downloading and 
indexing multi-media documents and/or streaming data, 
the conversational portal 11 can provide a service of 
composing a personalized "listening and watching" pro- 
gram for a subscribing user based on user preferences *5 
(for example, pre-selected topics or type of broadcast/ 
documents/list of interest). The user may also compose 
a menu of what the user will listen to. Upon connecting 
to the conversational portal 11 , the user can access the 
personalized program and playback desired content in 20 
any prespecrfied order. By way of example, a subscrib- 
ing user may generate a personalized radio on demand 
program which the user can access over a wireless 
phone connected to the conversational portal 11. In ad- 
dition, it is to be appreciated that during subsequent 25 
searches, the subscribing user may add to his/her per- 
sonalized program any additional multi-media seg- 
ments that are presented to the user in a search result 
list. At anytime during the program, the user can use the 
portal conversational browser commands to interrupt, 30 
pause or modify the program. 

[0065] Referring now to Figs. 3a and 3b, a flow dia- 
gram illustrates a method according to one aspect of the 
present invention for accessing information over a net- 
work using a conversational portal. Initially, referring to 35 
Fig. 3a, a user will access a conversational portal using 
any type of client/access device (step 1 00), for example, 
a telephone. In a preferred embodiment, upon connec- 
tion with the conversational portal, a user identification/ 
verification process is performed (step 101) to deter- <o 
mine if the user is an authorized user of the conversa- 
tional portal. It is to be understood that user identification 
is used in cases where personalization and/or login and 
billing is involved. 

[0066] It is to be understood that any conventional <5 
form of security or logon procedure may be employed. 
In a preferred embodiment, a speaker identification and 
verification process is performed using the methods dis- 
closed in the U.S. Patent No. 5,897,61 6 issued April 27, 
1 999 to Kanevsky, et al., entitled: "Apparatus and Meth- 50 
ods For Speaker Verification/ldentification/Classifica- 
tion Employing Non- Acoustic and/or Acoustic Models 
and Databases," which is commonly assigned and the 
disclosure of which is incorporated herein by reference. 
Briefly, this patent discloses a method for securing ac- 55 
cess to a service (such as the conversational portal) em- 
ploying automatic speech recognition, text-independent 
speaker identification, and natural language under- 



standing techniques, as well as other dynamic and static 
features. In one aspect, the authentication process in- 
cludes steps such as receiving and decoding spoken ut- 
terances of the speaker, which contain indicia of the 
speaker such as a name, address or customer number; 
accessing a database containing information on candi- 
date speakers; questioning the speaker based on the 
information; receiving, decoding and verifying an an- 
swer to the- question; obtaining a voice sample of the 
speaker and verifying the voice sample against a model; 
generating a score based on the answer and the voice 
sample; and granting access to the user if the score is 
equal to or greater than a threshold. 
[0067] Alternatively, speaker identification/verifica- 
tion may be performed via text-independent speaker 
recognition in the background of the dialog using the 
methods disclosed in the text-independent speaker ver- 
ification process based on frame-by-frame feature clas- 
sification as disclosed in detail in U.S. Patent Applica- 
tion Serial No. 08/788,471, filed on January 28. 1997, 
entitled: Text Independent Speaker Recognition for 
Transparent Command Ambiguity Resolution And Con- 
tinuous Access Control," which is commonly assigned 
and the disclosure of which is incorporated herein by 
reference. When speaker identification, one way or an- 
other, is performed, the output may be processed as if 
it was a voice cookie. More specifically, a conventional 
cookie is a piece of code that a web site ships to a brows- 
er when it connects to the site. The cookie may contain 
information about the user's preferences, past usage, 
etc. It can also contain digital certificates. Accordingly, 
speaker ID and verification can be used to build equiv- 
alent information (a cookie) which can be stored in the 
portal conversational or speech browser on the server 
side. Thereafter, upon connection to the conversational 
portal, user identification may be performed transpar- 
ently in the background using the cookie, which is equiv- 
alent to the presentation of a digital certificate. It is to be 
understood that as indicated above, the speaker identi- 
fication process may be used for user customization 
where, for example, user-preferences are set upon 
identification and verification of the speaker (for exam- 
ple, presentation formats, service access, billing sub- 
scription access, modality preferences, etc.) It is to be 
understood that any other login, identification, authenti- 
cation procedure may be employed such as user ID, 
password, SIMS number of a GSM cell phone or con- 
ventional cookies in the access client (browser). 
[0068] If, after the login process, it is determined that 
the user is not authorized (negative determination in 
step 102), communication between the client and the 
portal will be terminated (step 103). If on the other hand, 
it is determined that the user is authorized (affirmative 
determination in step 102), the user will be presented 
with a plurality of menus (step 104) (via synthesized 
speech, for example) associated with the "home page" 
of the conversational portal. The initial menu options 
may include, for example, searching for content pages 
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or services (CML or legacy pages/applications), access- 
ing real- time and prerecorded broadcasts or any legacy 
information using transcoding services, and accessing 
personalized programs for searching broadcast seg- 
ments of interest. 5 
[0069] Depending on the available menu options and 
the type of information that the user desires, the user 
will issue the appropriate search request (step 1 05). The 
conversational portal 11 (via the portal conversational 
browser 22) will interpret the query and provide the in- w 
terpreted query to the search engine 23 to perform the 
search accordingly (step 106). Again, based on the se- 
lected menu option, the requested search could be, for 
example, to retrieve certain WWW or CML content pag- 
es, broadcasts from broadcast-based web sites, or *5 
stored segments of indexed broadcasts. Depending on 
the type of search requested, the search engine 23 will 
search either the WWW, the portal speech directories 
26, and/or the database of indexed broadcasts 31, and 
return in CML a ranked list of possible matches (step 20 
107). The ranked list is then rendered back to the user 
via the portal conversational browser 22 or speech 
browser 24 (step 108). 

[0070] Assuming the user requested a search for a 
particular web document (or service), the ranked list will 25 
contain a list of web sites from which the user can select 
to download the document. If the user does not desire 
to retrieve a particular document from the list (negative 
decision in step 109), the user may either continue with 
an additional search (affirmative result in step 110 and 30 
return to step 104) or disconnect from the conversation- 
al portal (step 103). If, on the other hand, the user de- 
sires to retrieve a particular document from the list (af- 
firmative decision in step 109), the user can issue an 
appropriate multi-modal command (voice or mouse 35 
click) to retrieve a desired document (step 111). The 
conversational browser will generate and transmit an 
appropriate request to download the desired document 
from the corresponding content server 18. 
[0071] In the preferred embodiment, if the desired *o 
document is in a presentation format (such as HTML) 
other than CML (negative result in step 112), the docu- 
ment is transmitted to the appropriate transcoder to con- 
vert the document into an appropriate CML format (step 
113), which is then rendered for playback to the user via 45 
a conversational browser (running on the client or serv- 
er, or both (in a distributed topology) (step 114). As in- 
dicated above, the transcoder may reside, for example, 
in the conversational portal 11 server or a proxy server 
associated with, for example, the content server from so 
which the document is retrieved. Alternatively, in the 
case of a legacy client browser, the retrieved document 
may be transcoded to the appropriate modality (for ex- 
ample, a CML or HTML document may be transcoded 
to a VoiceXML document for rendering on a client 55 
speech browser). It is to be understood that a retrieved 
document in a streaming audio/multi-media format is not 
converted to CML or any other legacy ML. If, on the other 



hand, the presentation format of the requested docu- 
ment is in CML, the document is transmitted directly to 
the conversational browser (client, server or both (dis- 
tributed)) for rendering (step 114). 
[0072] Returning again to step 1 08, assuming the us- 
er requested a search for a particular web broadcast 
(live or prerecorded broadcasts of radio or video pres- 
entations), the ranked list will contain a list of web sites 
that offer such broadcasts from which the user can se- 
lect a desired broadcast (step 115, Fig. 3b). If the user 
does not desire to retrieve a particular broadcast in the 
list (negative decision in step 115), the user may either 
continue with an additional search (affirmative result in 
step 119 and return to step 104, Fig. 3a) or disconnect 
from the conversational portal (step 103, Fig. 3a). If, on 
the other hand, the user desires to download a particular 
broadcast in the list (affirmative decision in step 115), 
the user can issue an appropriate (multi-modal) com- 
mand to download the desired broadcast (step 116). 
The portal conversational browser 22 will generate and 
transmit an appropriate request to connect to the con- 
tent server providing the desired broadcast (step 117). 
Optionally, the user can issue an command to have the 
broadcast indexed (via the audio indexing system 30) 
for playback and search at a later time (step 118). 
[0073] Returning again to step 1 08, assuming the us- 
er requested a search for certain prestored/indexed 
segments of web broadcasts, the ranked list will contain 
a list of available segments (audio/ audio-visual) from 
which the user can select (step 120, Fig. 3b). If the user 
does not desire to retrieve any of the listed segments 
(negative decision in step 1 20), the user may either con- 
tinue with an additional search (affirmative result in step 
119 and return to step 104, Fig. 3a) or disconnect from 
the conversational portal (negative result in step 119 
and return to step 103, Fig. 3a). If, on the other hand, 
the user desires to playback one or more segments in 
the list (affirmative decision in step 120), the user can 
issue an appropriate (multi-modal) command to down- 
load the such segment(s) (step 121). Using appropriate 
plug ins. the portal conversational browser 22 or speech 
browser 24 will playback the selected segments to the 
user (step 122). Optionally, using the appropriate plug- 
ins, the user can issue commands to control the play- 
back of the segments (such as fast forward, rewind and 
search). 

[0074] In summary, the present invention advanta- 
geously affords conversational (multi-modal) access to 
the WWW, for example, from anywhere at anytime 
through any connected appropriate device so as to ex- 
tract desired information and/or build a personalized 
broadcast program on demand, as well as manage and 
modify the program at any time. It is to be appreciated 
that the present invention provides multiple advantages 
over conventional systems. For instance, the present in- 
vention allows a user to perform multi-modal searches 
of real-time and prerecorded broadcasts and select seg- 
ments on topics of interest for multi-modal playback. An- 
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other advantage is that it further allows a user to access 
documents and services in any format (CML or legacy) 
regardless of the I/O capabilities of the client/access de- 
vice. Indeed, the retrieved pages may be in CML format 
or converted to CML format on-the-fly for rendering by 5 
a conversational (multimodal) browser. 
[0075] Furthermore, the present invention allows a 
user to generate programs that he/she will follow and 
allows the user to interrupt or modify the program at an- 
ytime. In addition, the user can search for alternatives n> 
while watching or listening to a given segment (back- 
ground or off line search). Another advantage is that the 
present invention provides a service that allows a user, 
via, for example, a single phone number, to access 
broadcast on demand from anywhere at anytime. In- is 
deed, with the expansion of wireless networks, such 
service can be accessed via any wirelessly connected 
device. The conventional services described above do 
not offer such capabilities. Indeed, broadcast on de- 
mand and true interactive programming are long-stand- 20 
ing need that until this invention was proposed has not 
been appropriately satisfied by any of the conventional 
systems described above. 

[0076] Moreover, with respect to a business aspect of 
the present invention, there are a variety of viable busi- 25 
ness models. As indicated above, the conversational 
portal service can be subscription based, with revenue 
being generated from various channels. For instance, 
companies or content providers may register with the 
service provider of the conversational portal to be part 30 
of the manually managed portal directories 26 upon pay- 
ment of an appropriate fee. In addition, revenue may be 
generated through user subscription, for example, a flat 
rate or a fee per usage which then requires billing. Billing 
can then be performed knowing the user (ID of the con- 35 
nection browser, calling phone or biometric/verification 
or login to the conversational portal). In addition, pay- 
ment/revenue for the conversational portal can be ob- 
tained directly via agreement with the channel carrier 
(for example, telephony carrier, wireless carrier or ISP). <o 
[0077] In addition, another business model is to have 
the conversational portal open to everybody for conver- 
sational access to content pages, service and broadcast 
content. In such a case, revenue may be generated from 
fees that are paid by subscribing users/companies for <5 
advertisements and/or other services provided by the 
conversational portal 11 on behalf of the subscribing us- 
er/company. For instance, the call capture option of the 
conversational portal can provide a direct revenue 
stream by providing advertisements (banners) in be- so 
tween fetches that are made via the portal (for example, 
when a new search is performed). 
[0078] Moreover, by continuously listening to the con- 
versation (call capture), the conversational portal can be 
the primary mechanism by which the user can access 55 
other services (such as universal messaging, e-mail, di- 
rectory assistance, map/traffic assistance etc.), wherein 
the service provider of such services will pay extra fee 



to be prominently available at that level (instead of being 
accessible through more advanced menu search from 
the portal). This "capture" mechanism of the conversa- 
tional portal significantly increases the average time 
spent by the user on the portal (as opposed to conven- 
tional portals that interact with the user only during the 
short time that the user enters a query and decides to 
follow a resulting link to a new site). Also by offering such 
services (which are always accessible), the portal sig- 
nificantly increases the chances that the user when 
needing a service will connect to the conversational por- 
tal when access to one of the services is desired. 



Claims 

1. A conversational portal, comprising: 

a conversational browser for conducting mul- 
ti-modal dialog with clients having varying input/out- 
put (I/O) modalities, wherein the conversational 
browser retrieves information from an information 
source in response to a request from a requesting 
client and one of serves and presents the retrieved 
information to the requesting client in a format that 
is compatible with the I/O modalities of the request- 
ing client. 

2. The conversational portal of claim 1, wherein the 
information provided by the information sources is 
implemented in a multi-modal representation. 

3. The conversational portal of claim 1 or claim 2, 
wherein the multi-modal representation is a modal- 
ity-independent format. 

4. The conversational portal of any of claims 1 to 3, 
further comprising a transcoder, operatively associ- 
ated with the conversational browser, for converting 
the multi-modal information into at least one modal- 
ity-specific format based on the I/O modalities of the 
requesting client. 

5. The conversational portal of any preceding claim, 
wherein the conversational portal detects the I/O 
modalities of the requesting client to convert the 
multi-modal information into the at least one modal- 
ity-specific format 

6. The conversational portal of any preceding claim, 
wherein the conversational portal detects the I/O 
modalities of the requesting client based on one of 
registration protocols and identification of the ac- 
cess channel. 

7. The conversational portal of any preceding claim, 
further comprising a portal directory database, ac- 
cessible by the conversational browser, for storing 
one of an index of information sources, information 
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associated with information sources, and a combi- 
nation thereof. 

8. The conversational portal of any preceding claim, 
further comprising a capture module for capturing a 
connection between the requesting client and the 
conversational portal and holding the client captive 
during predetermined time periods. 

9. The conversational portal of claim 8, wherein the 
client is held captive between a time period where 
a link provided by the conversational browser is se- 
lected by the requesting client and one or rendered 
and served to the requesting client. 

10. The conversational portal of claim 8 or claim 9, 
wherein the requesting client is released when a link 
is directly requested by the requesting client. 

11. The conversational portal of any of claims 8 to 10, 
wherein a service provider of the conversational 
portal provides one of advertisements, services and 
a combination thereof, during at least one predeter- 
mined time period in which the requesting client is 
held captive. 

12. The conversational portal of claim 11, wherein the 
at least one predetermined time period is a time pe- 
riod between fetching links between different infor- 
mation sources. 

13. The conversational portal of claim 11 or claim 12, 
wherein the advertisements and services are multi- 
modal. 

14. The conversational portal of any preceding claim, 
further comprising: 

an audio indexing system for segmenting and 
indexing audio and multimedia data obtained 
from an information source; and 



wherein the conversational portal maintains, for a 
registered subscriber, a program comprising user- 
selected multimedia segments in the multimedia 
database. 

5 

18. The conversational portal of claim 17, wherein the 
registered subscriber can conversationally navi- 
gate the program and select desired segments for 
broadcasting via the requesting client. 

10 

1 9. The conversational portal of claims 1 7 or 1 8, where- 
in the program comprises radio on demand service 
which the registered subscriber accesses via a 
wireless phone 

15 

20. A system for accessing information, comprising: 

an access device having at least one input/out- 
put modality; 

20 

a content server; and 

the conversational portal of any preceding 
claim. 

25 

21. A method for providing access to information over 
a communications network, comprising the steps 
of: 

30 establishing communication with a conversa- 

tional portal using an access device having at 
least one input/output modality associated 
therewith; 

35 retrieving, by the conversational portal, infor- 

mation in response to a user request; 

one of presenting and serving, by the conver- 
sational portal, the information to the user 
<o based on the at least one I/O modality of the 

access device. 
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a multimedia database for storing the indexed 
audio and multimedia data. 

45 

15. The conversational portal of claim 14, wherein the 
conversational browser obtains desired segments 
from the multimedia database in response to a cli- 
ent request and presents such segments to the cli- 
ent based on the I/O capabilities of the client 50 

16. The conversational portal of claim 14 or 15, wherein 
the conversational browser periodically downloads 
multimedia data from at least one information 
source to index and store the multimedia data in the 55 
multimedia database. 



17. The conversational portal of any of claims 14 to 16, 
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